library(pprint)
library(haven)
library(Statamarkdown)
Stata found at C:/Program Files/Stata16/StataMP-64.exe
Warning: incomplete final line found on 'profile.do'Found an existing 'profile.do'
cd "C:\Users\slupp\OneDrive\Skrivebord\NTNU\Mehmet\PSY8003\Dummy_moderation_and_mediation"
// section 1: dummmy variable regression
use flats.dta
// section 3: Mediation
*use workout.dta
The 'stata' engine is ready to use.
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
setwd("C:/Users/slupp/OneDrive/Skrivebord/NTNU/Mehmet/PSY8003/Dummy_moderation_and_mediation")
flats <- readRDS("flats.rds")
> nd_mediation"
C:\Users\slupp\OneDrive\Skrivebord\NTNU\Mehmet\PSY8003\Dummy_moderation_and_med
> iation
R
# NB the variable is already coded as a factor, so this is only to show how its done
# SO DO NOT RUN/OVERWRITE ORIGINAL DATASET
flats2 <- flats %>% mutate(location = factor(location,
levels = c(1, 2, 3, 4),
labels = c("centre",
"south",
"west",
"east")))
STATA
location (unlabeled)
-------------------------------------------------------------------------------
type: numeric (long)
label: labels_location
range: [1,4] units: 1
unique values: 4 missing .: 0/95
tabulation: Freq. Numeric Label
34 1 centre
9 2 south
11 3 west
41 4 east
If our variable is coded correctly (as a factor) we can see that R automatically treats it as a (or rather mutiple) dummy variable(s)
reg_price_location <- lm(flat_price ~ location, data = flats)
preg(reg_price_location)
Model fit: flat_price
# Color data frame (class colorDF) 6 x 3:
│SS │df │M. │M │R │R.
Model│2.95e+11│ 3│F(3, 91)│ 2.05│ R-squared│6.34e-02
Residual│4.36e+12│ 91│prob > F│ 0.1│adj R-squared│3.25e-02
Total│4.65e+12│ 94│ N│ 95│ MSE│2.19e+05
Coefficients: flat_price
# Color data frame (class colorDF) 6 x 4:
│ Coefficients│ std.error│ t-value│ p-value│ CI[2.5%]
(Intercept)│ 597721│ 37524│ 15.93│ <2e-16│ 523183
locationsouth│ -172350│ 82021│ -2.10│ 0.0384│ -335275
locationwest│ -83933│ 75897│ -1.11│ 0.2717│ -234692
locationeast│ -97993│ 50752│ -1.93│ 0.0566│ -198805
│ CI[97.5%]
(Intercept)│ 672258
locationsouth│ -9425
locationwest│ 66827
locationeast│ 2819
In STATA it is not that simple
Source | SS df MS Number of obs = 95
-------------+---------------------------------- F(1, 93) = 2.90
Model | 1.4083e+11 1 1.4083e+11 Prob > F = 0.0917
Residual | 4.5107e+12 93 4.8502e+10 R-squared = 0.0303
-------------+---------------------------------- Adj R-squared = 0.0198
Total | 4.6515e+12 94 4.9484e+10 Root MSE = 2.2e+05
------------------------------------------------------------------------------
flat_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
location | -28584.31 16775 -1.70 0.092 -61896.14 4727.521
_cons | 604303.4 49434.23 12.22 0.000 506136.9 702470
------------------------------------------------------------------------------
In STATA we need to use a i. in front (Likewise we can force a categorical variable as continuous using c.)
Source | SS df MS Number of obs = 95
-------------+---------------------------------- F(3, 91) = 2.05
Model | 2.9488e+11 3 9.8294e+10 Prob > F = 0.1120
Residual | 4.3566e+12 91 4.7875e+10 R-squared = 0.0634
-------------+---------------------------------- Adj R-squared = 0.0325
Total | 4.6515e+12 94 4.9484e+10 Root MSE = 2.2e+05
------------------------------------------------------------------------------
flat_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
location |
south | -172350.2 82021.29 -2.10 0.038 -335275.4 -9424.993
west | -83932.71 75896.81 -1.11 0.272 -234692.4 66826.99
east | -97992.95 50751.9 -1.93 0.057 -198805.4 2819.474
|
_cons | 597720.6 37524.39 15.93 0.000 523183 672258.2
------------------------------------------------------------------------------
This works even if our variable lacks labels
Source | SS df MS Number of obs = 95
-------------+---------------------------------- F(3, 91) = 2.05
Model | 2.9488e+11 3 9.8294e+10 Prob > F = 0.1120
Residual | 4.3566e+12 91 4.7875e+10 R-squared = 0.0634
-------------+---------------------------------- Adj R-squared = 0.0325
Total | 4.6515e+12 94 4.9484e+10 Root MSE = 2.2e+05
------------------------------------------------------------------------------
flat_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
location |
2 | -172350.2 82021.29 -2.10 0.038 -335275.4 -9424.993
3 | -83932.71 75896.81 -1.11 0.272 -234692.4 66826.99
4 | -97992.95 50751.9 -1.93 0.057 -198805.4 2819.474
|
_cons | 597720.6 37524.39 15.93 0.000 523183 672258.2
------------------------------------------------------------------------------
cat(interpretation_price_location)
Mean Centre = 597 721$
Mean South = 425 470$ (p = .038)
Mean West = 513 788$
Mean East = 499 728$
We can also see that the model as a whole is non-significant with an R^2 around 3%
One problem with this approach is that we’re only using planned contrasts. So we are unable to test the difference between all loactions. We could simply change the refrence group, and re-run the regression:
flats2$location <- relevel(flats$location, ref = "south")
reg_price_location <- lm(flat_price ~ location, data = flats)
preg(reg_price_location)
Model fit: flat_price
# Color data frame (class colorDF) 6 x 3:
│SS │df │M. │M │R │R.
Model│2.95e+11│ 3│F(3, 91)│ 2.05│ R-squared│6.34e-02
Residual│4.36e+12│ 91│prob > F│ 0.1│adj R-squared│3.25e-02
Total│4.65e+12│ 94│ N│ 95│ MSE│2.19e+05
Coefficients: flat_price
# Color data frame (class colorDF) 6 x 4:
│ Coefficients│ std.error│ t-value│ p-value│ CI[2.5%]│ CI[97.5%]
(Intercept)│ 597721│ 37524│ 15.93│ <2e-16│ 523183│ 672258
locationsouth│ -172350│ 82021│ -2.10│ 0.0384│ -335275│ -9425
locationwest│ -83933│ 75897│ -1.11│ 0.2717│ -234692│ 66827
locationeast│ -97993│ 50752│ -1.93│ 0.0566│ -198805│ 2819
In STATA we can simply use an extension of the i. operator: ib#, where # indicates what variable to use as our base level (i.e., our reference group)
Source | SS df MS Number of obs = 95
-------------+---------------------------------- F(3, 91) = 2.05
Model | 2.9488e+11 3 9.8294e+10 Prob > F = 0.1120
Residual | 4.3566e+12 91 4.7875e+10 R-squared = 0.0634
-------------+---------------------------------- Adj R-squared = 0.0325
Total | 4.6515e+12 94 4.9484e+10 Root MSE = 2.2e+05
------------------------------------------------------------------------------
flat_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
location |
centre | 172350.2 82021.29 2.10 0.038 9424.993 335275.4
west | 88417.5 98344.59 0.90 0.371 -106932 283767
east | 74357.26 80542.46 0.92 0.358 -85630.44 234345
|
_cons | 425370.4 72934.3 5.83 0.000 280495.3 570245.4
------------------------------------------------------------------------------
Of note we should adjust p-values when performing multiple comparisons, which there are many ways of doing. We will not look into it today.
library(ggplot2)
flats %>% ggplot(mapping = aes(x = location,
y = flat_price,
colour = location)) +
geom_boxplot() +
geom_jitter(size = 0.1) +
stat_summary(fun = "mean",
geom = "point",
colour = "green") +
stat_summary(fun.data = "mean_cl_boot",
geom = "errorbar",
colour = "darkblue",
width = 0.2)
NA
NA
NA
NA
NA
Since (unless we add interaction terms) our models are additive, covariates can be both continous and categorical
R
# Renaming energy-efficiency to make the output a bit more readable
flats2 <- flats %>% rename(e_effic = energy_efficiency)
reg_price_location_energy <- lm(flat_price ~ location + e_effic,
data = flats2)
preg(reg_price_location_energy)
Model fit: flat_price
# Color data frame (class colorDF) 6 x 3:
│SS │df │M. │M │R │R.
Model│5.12e+11│ 5│F(5, 76)│ 2.16│ R-squared│1.25e-01
Residual│3.60e+12│ 76│prob > F│ 0.07│adj R-squared│6.70e-02
Total│4.11e+12│ 81│ N│ 82│ MSE│2.18e+05
Coefficients: flat_price
# Color data frame (class colorDF) 6 x 6:
│ Coefficients│ std.error│ t-value│ p-value│ CI[2.5%]│ CI[97.5%]
(Intercept)│ 544392│ 99239│ 5.486│5.19e-07│ 346742│ 742043
locationcentre│ 199209│ 85100│ 2.341│ 0.0219│ 29717│ 368700
locationwest│ 97061│ 100220│ 0.968│ 0.3359│ -102545│ 296666
locationeast│ 66112│ 81137│ 0.815│ 0.4177│ -95486│ 227711
e_efficmediocre│ -112341│ 77756│ -1.445│ 0.1526│ -267206│ 42524
e_efficpoor│ -198577│ 85725│ -2.316│ 0.0232│ -369312│ -27842
Source | SS df MS Number of obs = 82
-------------+---------------------------------- F(5, 76) = 2.16
Model | 5.1228e+11 5 1.0246e+11 Prob > F = 0.0669
Residual | 3.5993e+12 76 4.7360e+10 R-squared = 0.1246
-------------+---------------------------------- Adj R-squared = 0.0670
Total | 4.1116e+12 81 5.0761e+10 Root MSE = 2.2e+05
------------------------------------------------------------------------------
flat_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
location |
south | -199208.5 85099.86 -2.34 0.022 -368699.6 -29717.43
west | -102147.8 82903.83 -1.23 0.222 -267265 62969.53
east | -133096 57940.96 -2.30 0.024 -248495.4 -17696.59
|
energy_eff~y |
mediocre | -112340.8 77756.19 -1.44 0.153 -267205.7 42524.08
poor | -198576.9 85724.61 -2.32 0.023 -369312.3 -27841.56
|
_cons | 743600.9 87636.03 8.49 0.000 569058.7 918143.2
------------------------------------------------------------------------------
R
reg_price_location_floor_size <- lm(flat_price ~ location + floor_size,
data = flats)
preg(reg_price_location_floor_size)
Model fit: flat_price
# Color data frame (class colorDF) 6 x 3:
│SS │df │M. │M │R │R.
Model│3.04e+12│ 4│F(4, 90)│42.35│ R-squared│6.53e-01
Residual│1.61e+12│ 90│prob > F│6e-20│adj R-squared│6.38e-01
Total│4.65e+12│ 94│ N│ 95│ MSE│1.34e+05
Coefficients: flat_price
# Color data frame (class colorDF) 6 x 5:
│ Coefficients│ std.error│ t-value│ p-value│ CI[2.5%]│ CI[97.5%]
(Intercept)│ -16494│ 57173│ -0.288│0.773633│ -130079│ 97091
locationcentre│ 182491│ 50204│ 3.635│0.000462│ 82752│ 282230
locationwest│ 103287│ 60199│ 1.716│0.089646│ -16309│ 222883
locationeast│ 136509│ 49548│ 2.755│0.007101│ 38074│ 234944
floor_size│ 5295│ 428│ 12.368│ < 2e-16│ 4445│ 6146
STATA
Source | SS df MS Number of obs = 95
-------------+---------------------------------- F(4, 90) = 42.35
Model | 3.0377e+12 4 7.5942e+11 Prob > F = 0.0000
Residual | 1.6138e+12 90 1.7931e+10 R-squared = 0.6531
-------------+---------------------------------- Adj R-squared = 0.6376
Total | 4.6515e+12 94 4.9484e+10 Root MSE = 1.3e+05
------------------------------------------------------------------------------
flat_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
location |
south | -182490.9 50203.81 -3.64 0.000 -282229.5 -82752.28
west | -79203.74 46450.5 -1.71 0.092 -171485.8 13078.29
east | -45981.88 31343.61 -1.47 0.146 -108251.4 16287.69
|
floor_size | 5295.314 428.1549 12.37 0.000 4444.709 6145.918
_cons | 165996.8 41784.01 3.97 0.000 82985.57 249008
------------------------------------------------------------------------------
R
library(margins)
pred_price_location_size <- margins(reg_price_location_floor_size,
at = list(location = c("centre",
"south",
"west",
"east"),
floor_size = c(20,220)
)
)
pred_price_location_size %>% ggplot(mapping = aes(x = floor_size,
y = fitted,
colour = location)) +
geom_smooth(method = "lm", se = F)
STATA
Source | SS df MS Number of obs = 95
-------------+---------------------------------- F(4, 90) = 42.35
Model | 3.0377e+12 4 7.5942e+11 Prob > F = 0.0000
Residual | 1.6138e+12 90 1.7931e+10 R-squared = 0.6531
-------------+---------------------------------- Adj R-squared = 0.6376
Total | 4.6515e+12 94 4.9484e+10 Root MSE = 1.3e+05
------------------------------------------------------------------------------
flat_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
location |
south | -182490.9 50203.81 -3.64 0.000 -282229.5 -82752.28
west | -79203.74 46450.5 -1.71 0.092 -171485.8 13078.29
east | -45981.88 31343.61 -1.47 0.146 -108251.4 16287.69
|
floor_size | 5295.314 428.1549 12.37 0.000 4444.709 6145.918
_cons | 165996.8 41784.01 3.97 0.000 82985.57 249008
------------------------------------------------------------------------------
Adjusted predictions Number of obs = 95
Model VCE : OLS
Expression : Linear prediction, predict()
1._at : location = 1
floor_size = 20
2._at : location = 1
floor_size = 70
3._at : location = 1
floor_size = 120
4._at : location = 1
floor_size = 170
5._at : location = 1
floor_size = 220
6._at : location = 2
floor_size = 20
7._at : location = 2
floor_size = 70
8._at : location = 2
floor_size = 120
9._at : location = 2
floor_size = 170
10._at : location = 2
floor_size = 220
11._at : location = 3
floor_size = 20
12._at : location = 3
floor_size = 70
13._at : location = 3
floor_size = 120
14._at : location = 3
floor_size = 170
15._at : location = 3
floor_size = 220
16._at : location = 4
floor_size = 20
17._at : location = 4
floor_size = 70
18._at : location = 4
floor_size = 120
19._at : location = 4
floor_size = 170
20._at : location = 4
floor_size = 220
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at |
1 | 271903.1 34948.56 7.78 0.000 202471.6 341334.5
2 | 536668.7 23489.52 22.85 0.000 490002.7 583334.8
3 | 801434.4 28261.2 28.36 0.000 745288.6 857580.2
4 | 1066200 44296.92 24.07 0.000 978196.5 1154204
5 | 1330966 63579.25 20.93 0.000 1204654 1457277
6 | 89412.15 52251.76 1.71 0.090 -14395.08 193219.4
7 | 354177.8 45005.51 7.87 0.000 264766.5 443589.1
8 | 618943.5 47300.4 13.09 0.000 524973 712914
9 | 883709.2 58015.03 15.23 0.000 768452.2 998966.2
10 | 1148475 73557.74 15.61 0.000 1002340 1294610
11 | 192699.3 48001.31 4.01 0.000 97336.34 288062.3
12 | 457465 40630.68 11.26 0.000 376745.1 538184.9
13 | 722230.7 43751.13 16.51 0.000 635311.4 809149.9
14 | 986996.4 55624.22 17.74 0.000 876489.1 1097504
15 | 1251762 72045.32 17.37 0.000 1108631 1394893
16 | 225921.2 30454.41 7.42 0.000 165418.2 286424.2
17 | 490686.9 20925.63 23.45 0.000 449114.5 532259.3
18 | 755452.5 29408.76 25.69 0.000 697026.9 813878.2
19 | 1020218 46994.17 21.71 0.000 926856.1 1113580
20 | 1284984 66847.67 19.22 0.000 1152179 1417788
------------------------------------------------------------------------------
Variables that uniquely identify margins: floor_size location
reg_price_size_int_year <- lm(flat_price ~ floor_size * year_built, data = flats)
preg(reg_price_size_int_year)
Model fit: flat_price
# Color data frame (class colorDF) 6 x 3:
│SS │df │M. │M │R │R.
Model│2.79e+12│ 3│F(3, 75)│47.95│ R-squared│6.57e-01
Residual│1.45e+12│ 75│prob > F│2e-17│adj R-squared│6.44e-01
Total│4.25e+12│ 78│ N│ 79│ MSE│1.39e+05
Coefficients: flat_price
# Color data frame (class colorDF) 6 x 4:
│ Coefficients│ std.error│ t-value│ p-value│ CI[2.5%]│ CI[97.5%]
(Intercept)│ -53097803│ 1.44e+07│ -3.70│0.000410│-81694781│ -24500824
floor_size│ 615583│ 1.68e+05│ 3.67│0.000445│ 281824│ 949342
year_built│ 26541│ 7.16e+03│ 3.71│0.000398│ 12281│ 40800
floor_size:year_built│ -304│ 8.35e+01│ -3.64│0.000495│ -471│ -138
STATA
In STATA we recommend that one uses c. or i. to specify, in some
instances it can cause problems if they are not specified correctly
(e.g., try this code
reg flat_price floor_size##year_built)
Source | SS df MS Number of obs = 79
-------------+---------------------------------- F(3, 75) = 47.95
Model | 2.7905e+12 3 9.3017e+11 Prob > F = 0.0000
Residual | 1.4548e+12 75 1.9397e+10 R-squared = 0.6573
-------------+---------------------------------- Adj R-squared = 0.6436
Total | 4.2453e+12 78 5.4427e+10 Root MSE = 1.4e+05
------------------------------------------------------------------------------
flat_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
floor_size | 615582.8 167541.2 3.67 0.000 281823.7 949341.9
year_built | 26540.77 7157.913 3.71 0.000 12281.48 40800.06
|
c. |
floor_size#|
c.year_built | -304.3031 83.54934 -3.64 0.000 -470.7419 -137.8642
|
_cons | -5.31e+07 1.44e+07 -3.70 0.000 -8.17e+07 -2.45e+07
------------------------------------------------------------------------------
3D
persp(reg_price_size_int_year,
xvar = "floor_size",
yvar = "year_built",
theta = -30,
phi = 15)
Error: object 'reg_price_size_int_year' not found
library(plotly)
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
# Removing NA's
flats_NA_removed <- na.omit(flats)
# Creating a grid representing our regression plane
grid_size <- 30
seq_floor_size <- seq(min(flats_NA_removed$floor_size, na.rm = T),
max(flats_NA_removed$floor_size, na.rm = T),
length.out = grid_size)
seq_year_built <- seq(min(flats_NA_removed$year_built, na.rm = T),
max(flats_NA_removed$year_built, na.rm = T),
length.out = grid_size)
year_size <- expand.grid(floor_size = seq_floor_size,
year_built = seq_year_built)
predicted_price <- matrix(predict(reg_price_size_int_year,
newdata = year_size),
ncol = grid_size,
nrow = grid_size,
byrow = TRUE)
plot_ly(data = flats_NA_removed,
type = "scatter3d",
x = ~floor_size,
y = ~year_built,
z = ~flat_price,
color = ~location,
size = I(150)) %>%
add_surface(x = seq_floor_size,
y = seq_year_built,
z = predicted_price,
opacity = .3)
No scatter3d mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
No scatter3d mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
NA
2D
library(margins)
pred_price_size_int_year <- margins(reg_price_size_int_year,
at = list(floor_size = c(20, 220),
year_built = seq(1930, 2010, 20)
)
)
pred_price_size_int_year <- pred_price_size_int_year %>%
mutate(year_built = factor(year_built,
levels = seq(1930, 2010, 20),
labels = as.character(
seq(1930, 2010, 20)
)
)
)
pred_price_size_int_year %>%
ggplot(mapping = aes(x = floor_size,
y = fitted,
colour = year_built)) +
geom_smooth(method = "lm", se = F)
STATA
reg flat_price c.floor_size##c.year_built
margins , at (floor_size = (20(200)220) year_built = (1930(20)2010))
marginsplot
Allowing for interactions between a continuous and categorical variable, can in essence be thought of as four separate regressions with their own intercept and slope.
This is even true when adding another continuous and/or categorical variable, but in that case all interaction terms have to be saturated (which quickly becomes complicated).
In this example we have a continuous variable X1 and two separate dummy variables X2 and X3: \(Y_i = \beta_0 + \beta_1X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \beta_4 X_{1i} X_{2i} + \beta_5 X_{1i} X_{3i} + \beta_6 X_{2i} + \beta_7 X_{1i} X_{2i} X_{3i}\)
NB: We could simply use lm(Y ~ X1*X2*X3, data = df) in
R, and reg Y X1##X2##X3 in STATA
This would then be the same as running a separate regression for each unique group (i.e., each unique combination of B2 and B3).
R
reg_price_location_int_size <- lm(flat_price ~ location*floor_size, data = flats)
preg(reg_price_location_int_size)
Model fit: flat_price
# Color data frame (class colorDF) 6 x 3:
│SS │df │M. │M │R │R.
Model│3.08e+12│ 7│F(7, 87)│24.34│ R-squared│6.62e-01
Residual│1.57e+12│ 87│prob > F│5e-18│adj R-squared│6.35e-01
Total│4.65e+12│ 94│ N│ 95│ MSE│1.34e+05
Coefficients: flat_price
# Color data frame (class colorDF) 6 x 8:
│ Coefficients│ std.error│ t-value│ p-value│ CI[2.5%]│ CI[97.5%]
(Intercept)│ 125185│ 149184│ 0.839│ 0.4037│ -171333│ 421704
locationcentre│ 36572│ 156501│ 0.234│ 0.8158│ -274491│ 347635
locationwest│ -251538│ 243004│ -1.035│ 0.3035│ -734535│ 231459
locationeast│ 2662│ 167061│ 0.016│ 0.9873│ -329390│ 334714
floor_size│ 3597│ 1705│ 2.110│ 0.0378│ 208│ 6987
locationcentre:floor_size│ 1750│ 1779│ 0.984│ 0.3280│ -1786│ 5286
locationwest:floor_size│ 4341│ 2883│ 1.506│ 0.1358│ -1390│ 10072
locationeast:floor_size│ 1589│ 1980│ 0.802│ 0.4246│ -2347│ 5525
STATA
Source | SS df MS Number of obs = 95
-------------+---------------------------------- F(7, 87) = 24.34
Model | 3.0793e+12 7 4.3991e+11 Prob > F = 0.0000
Residual | 1.5721e+12 87 1.8071e+10 R-squared = 0.6620
-------------+---------------------------------- Adj R-squared = 0.6348
Total | 4.6515e+12 94 4.9484e+10 Root MSE = 1.3e+05
------------------------------------------------------------------------------
flat_price | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
location |
south | -36571.84 156501.1 -0.23 0.816 -347634.7 274491
west | -288109.7 197565.6 -1.46 0.148 -680792.6 104573.3
east | -33910.22 88828.6 -0.38 0.704 -210466.7 142646.2
|
floor_size | 5347.314 506.5162 10.56 0.000 4340.558 6354.069
|
location#|
c.floor_size |
south | -1749.891 1778.903 -0.98 0.328 -5285.654 1785.872
west | 2591.292 2379.662 1.09 0.279 -2138.544 7321.128
east | -161.2235 1127.105 -0.14 0.887 -2401.466 2079.019
|
_cons | 161757.3 47295.29 3.42 0.001 67752.75 255761.8
------------------------------------------------------------------------------
R: The Model
pred_location_int_floor <- margins(reg_price_location_int_size,
at = list(location = c("centre",
"south",
"west",
"east"),
floor_size = c(0,220)
)
)
Warning: A 'at' value for 'floor_size' is outside observed data range (20,228)!
R: Running separate regressions for each location
Here you can see we get the same plot, showing that it indeed is equivalent to running four separate regressions (in terms of your coefficients).
flats %>% ggplot(mapping = aes(x = floor_size,
y = flat_price,
colour = location)) +
geom_smooth(method = "lm", se = FALSE, fullrange = T)
STATA
reg flat_price i.location##c.floor_size
margins , at(floor_size = (20(200)220) location = (1(1)4))
marginsplot
reg flat_price i.location##c.floor_size
margins , at(floor_size = (20(200)220) location = (1(1)4))
marginsplot
Run a regression with flat price as a dependent variable and energy efficiency as categorical variable
Change the reference group in the regression model
Add a continuous covariate to the regression model
Visualize the regression in a two-dimensional plot.
Run a multiple regression with an interaction between two continuous variables
Interpret the results
Visualize your model in 2D
Run a multiple regression with an interaction between a continuous and a categorical variable
Interpret the results
Visualize your model in 2D
We’re going to be using the workout dataset
step_1 <- lm(calories ~ attract, data = df)
preg(step_1)
Model fit: calories
# Color data frame (class colorDF) 6 x 3:
│SS │df │M. │M │R │R.
Model│ 57│ 1│F(1, 202)│23.51│ R-squared│0.1043
Residual│ 490│ 202│ prob > F│2e-06│adj R-squared│0.0998
Total│ 547│ 203│ N│ 204│ MSE│1.5571
Coefficients: calories
# Color data frame (class colorDF) 6 x 2:
│ Coefficients│ std.error│ t-value│ p-value│ CI[2.5%]│ CI[97.5%]
(Intercept)│ 3.08│ 0.224│ 13.76│ < 2e-16│ 2.643│ 3.527
attract│ 0.31│ 0.064│ 4.85│2.48e-06│ 0.184│ 0.437
STATA
// NB change profile.do
use workout.dta, clear
In a mediation analysis we have an independent variable \(X\)
A mediating variable \(M\)
An a dependent variable \(Y\)
In mediation we have the direct effects of \(X\mapsto M\), \(M\mapsto Y\) and the direct relationship \(c\) where \(X \mapsto Y\)
Lastly we have the indirect relationship \(c'\) where \(X \mapsto Y\), while controlling for \(M\)
These relationships are assessed through four steps
NB: Note that the language implies causation, but that it really requires experimental data to be sure
Simple regressions assessing \(c\), \(X \mapsto Y\):
\(Y = \beta_0 + \beta_1 X_{i} + e_i\)
Simple regression assessing \(X\mapsto M\) (effect a):
\(M = \beta_0 + \beta_1X_{i} +e_i\)
Simple regression assessing \(M\mapsto Y\) (effect b):
\(Y = \beta_0 + \beta_1 X_{i} + e_i\)
Multiple regression assessing \(c'\):
\(Y = \beta_0 + \beta_1 X_{i} + \beta_2 M_{i} + e_i\)
Delta (difference in \(\beta_1\) between step 1 and 4)
\[ \beta_{indirect} = \beta_{step_1} - \beta_{step4} \]
Sobel product (product of \(\beta\) from \(X\mapsto M\) and \(M\mapsto Y\) )
(i.e., this is the same approach as calculating a correlation matrix/covariance matrix from path analysis, where we multiply the effect of \(X\) on \(M\), with the partial effect of \(M\) on \(Y\))
\[ \beta_{indirect} = \beta_{X\mapsto M} - \beta_{partialM\mapsto Y} \]
Y = calories
X = attract
M = appear
Step 1
step_1 <- lm(calories ~ attract, data = workout)
preg(step_1)
Model fit: calories
# Color data frame (class colorDF) 6 x 3:
│SS │df │M. │M │R │R.
Model│ 57│ 1│F(1, 202)│23.51│ R-squared│0.1043
Residual│ 490│ 202│ prob > F│2e-06│adj R-squared│0.0998
Total│ 547│ 203│ N│ 204│ MSE│1.5571
Coefficients: calories
# Color data frame (class colorDF) 6 x 2:
│ Coefficients│ std.error│ t-value│ p-value│ CI[2.5%]│ CI[97.5%]
(Intercept)│ 3.08│ 0.224│ 13.76│ < 2e-16│ 2.643│ 3.527
attract│ 0.31│ 0.064│ 4.85│2.48e-06│ 0.184│ 0.437
Source | SS df MS Number of obs = 204
-------------+---------------------------------- F(1, 202) = 23.51
Model | 57.0014805 1 57.0014805 Prob > F = 0.0000
Residual | 489.758323 202 2.42454616 R-squared = 0.1043
-------------+---------------------------------- Adj R-squared = 0.0998
Total | 546.759804 203 2.69339805 Root MSE = 1.5571
------------------------------------------------------------------------------
calories | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
attract | .3104855 .0640344 4.85 0.000 .1842239 .4367471
_cons | 3.084593 .2241653 13.76 0.000 2.642589 3.526597
------------------------------------------------------------------------------
We can see that the effect of appear on calories is significant
Step 2
step_2 <- lm(appear ~ attract, data = workout)
preg(step_2)
Model fit: appear
# Color data frame (class colorDF) 6 x 3:
│SS │df │M. │M │R │R.
Model│ 421│ 1│F(1, 201)│587.74│ R-squared│0.745
Residual│ 144│ 201│ prob > F│ 1e-61│adj R-squared│0.744
Total│ 565│ 202│ N│ 203│ MSE│0.846
Coefficients: appear
# Color data frame (class colorDF) 6 x 2:
│ Coefficients│ std.error│ t-value│ p-value│ CI[2.5%]│ CI[97.5%]
(Intercept)│ 0.783│ 0.122│ 6.42│9.45e-10│ 0.543│ 1.024
attract│ 0.847│ 0.035│ 24.24│ < 2e-16│ 0.778│ 0.915
Source | SS df MS Number of obs = 203
-------------+---------------------------------- F(1, 201) = 587.74
Model | 421.035858 1 421.035858 Prob > F = 0.0000
Residual | 143.988773 201 .716362054 R-squared = 0.7452
-------------+---------------------------------- Adj R-squared = 0.7439
Total | 565.024631 202 2.79715164 Root MSE = .84638
------------------------------------------------------------------------------
appear | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
attract | .8465505 .0349188 24.24 0.000 .7776963 .9154047
_cons | .7831785 .1219268 6.42 0.000 .5427588 1.023598
------------------------------------------------------------------------------
We also see that the effect of attract on appear is significant
Step 3
step_3 <- lm(calories ~ appear, data = workout)
preg(step_3)
Model fit: calories
# Color data frame (class colorDF) 6 x 3:
│SS │df │M. │M │R │R.
Model│ 93.8│ 1│F(1, 201)│41.97│ R-squared│0.173
Residual│449.1│ 201│ prob > F│7e-10│adj R-squared│0.169
Total│542.9│ 202│ N│ 203│ MSE│1.495
Coefficients: calories
# Color data frame (class colorDF) 6 x 2:
│ Coefficients│ std.error│ t-value│ p-value│ CI[2.5%]│ CI[97.5%]
(Intercept)│ 2.654│ 0.236│ 11.24│ <2e-16│ 2.188│ 3.120
appear│ 0.407│ 0.063│ 6.48│ 7e-10│ 0.283│ 0.531
Source | SS df MS Number of obs = 203
-------------+---------------------------------- F(1, 201) = 41.97
Model | 93.7686694 1 93.7686694 Prob > F = 0.0000
Residual | 449.108178 201 2.23436904 R-squared = 0.1727
-------------+---------------------------------- Adj R-squared = 0.1686
Total | 542.876847 202 2.68750915 Root MSE = 1.4948
------------------------------------------------------------------------------
calories | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
appear | .4073758 .0628845 6.48 0.000 .2833778 .5313738
_cons | 2.654002 .23616 11.24 0.000 2.188333 3.119671
------------------------------------------------------------------------------
Since appear also has a significant effect upon appear we can move on to step 4
Step 4
step_4 <- lm(calories ~ attract + appear, data = workout)
preg(step_4)
Model fit: calories
# Color data frame (class colorDF) 6 x 3:
│SS │df │M. │M │R │R.
Model│ 97.3│ 2│F(2, 200)│21.83│ R-squared│0.179
Residual│445.6│ 200│ prob > F│3e-09│adj R-squared│0.171
Total│542.9│ 202│ N│ 203│ MSE│1.493
Coefficients: calories
# Color data frame (class colorDF) 6 x 3:
│ Coefficients│ std.error│ t-value│ p-value│ CI[2.5%]│ CI[97.5%]
(Intercept)│ 2.667│ 0.236│ 11.30│ < 2e-16│ 2.202│ 3.133
attract│ -0.153│ 0.122│ -1.25│ 0.212│ -0.393│ 0.088
appear│ 0.542│ 0.124│ 4.36│2.12e-05│ 0.296│ 0.787
Source | SS df MS Number of obs = 203
-------------+---------------------------------- F(2, 200) = 21.83
Model | 97.2580367 2 48.6290183 Prob > F = 0.0000
Residual | 445.618811 200 2.22809405 R-squared = 0.1792
-------------+---------------------------------- Adj R-squared = 0.1709
Total | 542.876847 202 2.68750915 Root MSE = 1.4927
------------------------------------------------------------------------------
calories | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
attract | -.1526638 .1219915 -1.25 0.212 -.3932183 .0878907
appear | .5417558 .1243949 4.36 0.000 .2964621 .7870496
_cons | 2.667387 .2360706 11.30 0.000 2.201881 3.132894
------------------------------------------------------------------------------
M is significant which points towards mediation. Since attract has become non-significant it suggests full mediation.
NB: There is a popular package for this in R, but it requires an older version of R
Delta
delta <- step_1$coefficients[2] - step_4$coefficients[2]
delta
attract
0.4631493
Sobel
alpha <- step_2$coefficients[2]
beta <- step_4$coefficients[3]
sobel <- alpha * beta
sobel
attract
0.4586237
We can use sobels effect to calculate its std.error, which can give us a p-statistic
std.error_alpha <- summary(step_2)[["coefficients"]][2,2]
std.error_beta <- summary(step_4)[["coefficients"]][3,2]
sobel_z <- sqrt(alpha^2*std.error_beta^2 +
beta^2*std.error_alpha^2 +
std.error_alpha^2*std.error_beta^2)
sobel_z
attract
0.1070804
CI_low = sobel - sobel_z*1.96
CI_high = sobel + sobel_z*1.96
cat("CI_low =", CI_low, "CI_high =", CI_high)
CI_low = 0.2487462 CI_high = 0.6685012
# install.packages("lavaan", dependencies = T)
# devtools::install_github("ihrke/rmedsem")
library(rmedsem)
library(lavaan)
This is lavaan 0.6-15
lavaan is FREE software! Please report any bugs.
// net install github, from("https://haghish.github.io/github/")
// github install mmoglu/medsem
Predict: ~ (e.g., Y ~ X)
Indication: =~ (Latent Variable ~ Meausured/Observed Variable
~~ covariance (e.g., X ~~ X)
~1 intercept or mean (e.g., X ~ 1 estimates the mean of variable X)
1* fixes parameter/loading to one (useful single item constructs)
NA* frees parameter or loading (useful to override default marker method, (e.g., f =~ NA*q)
a* labels the parameter ‘a’, used for model constraints (e.g., f =~ a*q)
model <- "
appear ~ attract
calories ~ attract + appear
"
sem_1 <- sem(model, data = workout)
summary(sem_1)
lavaan 0.6.15 ended normally after 1 iteration
Estimator ML
Optimization method NLMINB
Number of model parameters 5
Used Total
Number of observations 203 246
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Regressions:
Estimate Std.Err z-value P(>|z|)
appear ~
attract 0.847 0.035 24.364 0.000
calories ~
attract -0.153 0.121 -1.261 0.207
appear 0.542 0.123 4.388 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.appear 0.709 0.070 10.075 0.000
.calories 2.195 0.218 10.075 0.000
(43 observations with missing values excluded)
Endogenous variables
Observed: calories appear
Exogenous variables
Observed: attract
Fitting target model:
Iteration 0: log likelihood = -1016.9388
Iteration 1: log likelihood = -1016.9388
Structural equation model Number of obs = 203
Estimation method = ml
Log likelihood = -1016.9388
------------------------------------------------------------------------------
| OIM
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Structural |
calories |
appear | .5417558 .1234723 4.39 0.000 .2997546 .783757
attract | -.1526638 .1210867 -1.26 0.207 -.3899894 .0846618
_cons | 2.667387 .2343198 11.38 0.000 2.208129 3.126646
-----------+----------------------------------------------------------------
appear |
attract | .8465505 .0347464 24.36 0.000 .7784488 .9146522
_cons | .7831785 .1213247 6.46 0.000 .5453865 1.020971
-------------+----------------------------------------------------------------
var(e.calo~s)| 2.195167 .2178886 1.807085 2.66659
var(e.appear)| .7093043 .0704044 .5839071 .8616312
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(0) = 0.00, Prob > chi2 = .
Structural equation model Number of obs = 203
Estimation method = ml
Log likelihood = -1016.9388
------------------------------------------------------------------------------
| OIM
Standardized | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Structural |
calories |
appear | .5526964 .1197077 4.62 0.000 .3180737 .7873191
attract | -.1588152 .1253989 -1.27 0.205 -.4045925 .0869621
_cons | 1.631111 .1948458 8.37 0.000 1.24922 2.013001
-----------+----------------------------------------------------------------
appear |
attract | .8632287 .0141675 60.93 0.000 .835461 .8909964
_cons | .4694346 .0857522 5.47 0.000 .3013634 .6375058
-------------+----------------------------------------------------------------
var(e.calo~s)| .820847 .0480668 .7318429 .9206754
var(e.appear)| .2548363 .0244595 .2111358 .3075818
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(0) = 0.00, Prob > chi2 = .
If both or one of the X->M and M->Y coefficients is not significant, there is no mediation
When both of the X->M and M->Y coefficients are significant, there is “some” mediation
If the Sobel’s z-test is significant and the X->Y coefficient is not significant, then there is complete mediation
If both the Sobel’s z-test and the X->Y coefficients are significant, then there is partial mediation
If the Sobel’s z-test is not significant but the X->Y coefficient is significant, then there is partial mediation
If neither Sobel’s z-test nor the X->Y coefficient are significant, then there is partial mediation
med_1 <- rmedsem(sem_1,
indep = "attract",
med = "appear",
dep = "calories",
approach = c("bk", "zlc"))
med_1
Significance testing of indirect effect (unstandardized)
Mediation effect: 'attract' -> 'appear' -> 'calories'
Baron and Kenny approach to testing mediation
Zhao, Lynch & Chen's approach to testing mediation
STEP 1 - 'attract:calories' (X -> Y) with B=-0.153 and p=0.207
As the Monte Carlo test above is significant and STEP 1 is not
significant there indirect-only mediation (full mediation).
Effect sizes
RIT = (Indirect effect / Total effect)
(0.459/0.306) = 1.499
Meaning that about 150% of the effect of 'attract'
on 'calories' is mediated by 'appear'
RID = (Indirect effect / Direct effect)
(0.459/0.153) = 3.004
That is, the mediated effect is about 3.0 times as
large as the direct effect of 'attract' on 'calories'
(43 observations with missing values excluded)
Endogenous variables
Observed: calories appear
Exogenous variables
Observed: attract
Fitting target model:
Iteration 0: log likelihood = -1016.9388
Iteration 1: log likelihood = -1016.9388
Structural equation model Number of obs = 203
Estimation method = ml
Log likelihood = -1016.9388
------------------------------------------------------------------------------
| OIM
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Structural |
calories |
appear | .5417558 .1234723 4.39 0.000 .2997546 .783757
attract | -.1526638 .1210867 -1.26 0.207 -.3899894 .0846618
_cons | 2.667387 .2343198 11.38 0.000 2.208129 3.126646
-----------+----------------------------------------------------------------
appear |
attract | .8465505 .0347464 24.36 0.000 .7784488 .9146522
_cons | .7831785 .1213247 6.46 0.000 .5453865 1.020971
-------------+----------------------------------------------------------------
var(e.calo~s)| 2.195167 .2178886 1.807085 2.66659
var(e.appear)| .7093043 .0704044 .5839071 .8616312
------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(0) = 0.00, Prob > chi2 = .
Significance testing of indirect effect (unstandardised)
+--------------------------------------------------------------------------+
Estimates | Delta | Sobel | Monte Carlo
|--------------------------------------------------------------------------|
Indirect effect | 0.459 | 0.459 | 0.466
Std. Err. | 0.106 | 0.106 | 0.113
z-value | 4.318 | 4.318 | 4.138
p-value | 0.000 | 0.000 | 0.000
Conf. Interval | 0.250 , 0.667 | 0.250 , 0.667 | 0.276 , 0.686
|--------------------------------------------------------------------------|
Baron and Kenny approach to testing mediation
STEP 1 - appear:attract (X -> M) with B=0.847 and p=0.000
STEP 2 - calories:appear (M -> Y) with B=0.542 and p=0.000
STEP 3 - calories:attract (X -> Y) with B=-0.153 and p=0.207
As STEP 1, STEP 2 and the Sobel's test above are significant
and STEP 3 is not significant the mediation is complete!
+--------------------------------------------------------------------------+
Note: to read more about this package help medsem
#install.packages("tidySEM")
library(tidySEM)
graph_sem(sem_1)